December 25, 2020
We are going to do missing value analysis on a dataset. Its going to be full of R commands.
x = c (“ggplot2”, “corrgram”, “DMwR”, “caret”, “randomForest”, “unbalanced”, “C50”, “dummies”, “e1071”, “Information”, “MASS”, “rpart”, “gbm”, “ROSE”)
Note :- DMwR - This library is required for KNN imputation
setwd(“/Users/mac/Downloads/“)
T1=read.csv(“melb_data.csv”, header = T, na.strings = c(” ”,"",“NA”))
Note:- na.strings- The na.strings parameter of the read function can be used to tell R which symbols/characters need to be treated as NA values
missing_val = data.frame(apply(T1, 2, function(x) {sum(is.na(x))}))
We calculate total number of missing values for every column
Here we use apply function to avoid loop
Inside apply, we pass the arguments like T1, and 2( since we do column level operation), and we create our own function named function, which calculates the number of missing values.
missingval$columns = row.names(missingval)//Adding row names into a separate column
row.names(missing_val) = NULL // Null row.names.
names(missingval)[1] = “missingpercentage” // Rename the first column as missing percentage
missingval$missingpercentage = (missingval$missingpercentage / nrow(T1)) *100
missingval = missingval[order(-missingval$missingpercentage),]
missingval = missingval[,c(2,1)]
write.csv(missingval, “Missingperc.csv”, row.names = F)
Below are three methods of missing value analysis. Now, take one value and remove it manually and impute all three methods and identify which method value gets closer to actual value and fix the method for analysis.
Now I am checking “YearBuilt” variable, 81th row and 16th column. and the answer is below.
Now i am manually making it as NA and going to compute all three methods
T1$YearBuilt[is.na(T1$YearBuilt)] = mean(T1$YearBuilt, na.rm = T)
Refresh the data before proceeding to next method.
T1$YearBuilt[is.na(T1$YearBuilt)] = median(T1$YearBuilt, na.rm = T)
T1 = knnImputation(T1, k = 5)
If you get an error, couldnot find Knn function please install library “DMwR”
library(“DMwR”)
Now the actual value of 81st row and 16th column of yearbuilt variable
Actual Value = 1890
We made this value NA and calculated below values
Using Mean the value is = 1964
Using Median the value is = 1970
Using KNN the value is = 1897
And the nearest value is 1897, so we go with KNN method for calculating all missing values.
Hope this post helps! will update you all with next post soon!!!